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Abstract. We give new algorithms for the computation of square roots and 
reciprocals of power series in C[x]. If M(n) denotes the cost of multiplying 
polynomials of degree n, the square root to order n costs (1.333 . . . + o(l))Af (n) 
and the reciprocal costs (1.444 . . . + o(l))M(n). These improve on the previous 
best results, respectively (1.8333 . . . + o(l))M(n) and (1.5 + o(l))M(n). 



1. Introduction 

It has been known for some time that various operations on power series, such 
as division, reciprocal, square root, and exponentiation, may be performed in a 
constant multiple of the time required for a polynomial multiplication of the same 
length. More recently the focus has been on improving the constants. A wealth of 
historical and bibliographical information tracking the downward progress of these 
constants may be found in Bcr04J and Bcr08 . In this paper we present results that 
improve on the best known constants for the square root and reciprocal operations. 

Let M(n) denote the cost of multiplying two polynomials in C[x] of degree less 
than n. By 'cost' or 'running time' we always mean number of ring operations in 
C. We assume FFT-based multiplication throughout, so that M(n) = 0(n\ogn). 

Let / € C[xJ, / = 1 mod x. There are two algorithms for computing / _1 mod 
x n that achieve the previously best known running time bound of (1.5 + o(l))M(n). 
The first is that of Schonhage [SchOOl Theorem 2]. If g is an approximation to / 
of length k, then the second-order Newton iteration g' — (2g — g 2 (f mod a; 2 *)) 
yields an approximation to / _1 of length 2k. Schonhage observed that it suffices 
to compute g 2 (f mod x 2k ) modulo x 3k — 1, which can be achieved by two forward 
FFTs and one inverse FFT of length 3k. Iterating this process, the running time 
bound follows easily. Bernstein's 'messy' algorithm Ber04, p. 10] is more compli- 
cated. Roughly speaking, he splits the input into blocks of consecutive coefficients, 
and applies a second-order Newton iteration at the level of blocks. This blocking 
strategy allows transforms of blocks to be reused between iterations. 

Our new reciprocal algorithm may be viewed as a simultaneous generalization of 
Bernstein's reciprocal algorithm and van der Hoeven's division algorithm |vdH06, 
p. 6]. The main innovation is the use of a third-order Newton iteration, whose 
additional term is computed essentially free of charge, leading to a running time of 
(1.444 . . . + o(l))M(n) (Theorem[S]). Although this is only a small improvement, it 
is interesting theoretically because the 'nice' bound (1.5 . . . + o(l))M (n), achieved 
by two quite different algorithms, had been a plausible candidate for the optimal 
bound for almost ten years. Furthermore, the methods presented in this paper 
suggest that (1.333... + o(l))M(n) may be attainable (see the final remark in 
Section [5|) . 
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For the square root, there are again two contenders for the previously best known 
bound of (1.8333 . . . + o(l))M(n). Bernstein computes the square root and recip- 
rocal square root together, alternately extending approximations of each [Bcr04 ( 
p. 9]. Hanrot and Zimmermann first compute the reciprocal square root to half the 
target precision, using a technique similar to Schonhage's, and then apply a differ- 
ent iteration at the last step to obtain the square root [HZ04 . (They claim only 
1.91666. . . , but there is an error in their analysis; the cost of line 3 of Algorithm 
'SquareRoot' is M(n)/3, not M(n)/2.) 

Our new square root algorithm achieves (1.333 . . . + o(l))M(n) (Thcorcm[3]). It is 
quite different to both of the above algorithms. It operates on blocks of coefficients, 
and may be viewed as a straightforward adaptation of van der Hoeven's division 
algorithm to the case of extracting square roots. 

For simplicity, in this paper we only discuss the case of C[xJ. It seems likely 
that the algorithms may also be adapted to achieve the same constants in the case 
of power series over an arbitrary ring, and also in the case of arbitrary-precision 
integers or real numbers, but we have not checked the details. 

2. Notation and complexity assumptions 

The Fourier transform !F n {g) G C™ of a polynomial g G C[x] is defined by 
{J- n (g))j = g(e 2m ^ n ). If degg < n, we denote by T(n) the cost of computing 
T n (g) from g, or of computing g from J- n {g). 

If gi,(72 G C[x] and deggi < n, the cyclic convolution gig 2 mod i" — 1 may 
be computed by evaluating J r ~ 1 (T n {gi)J~n{g2)), where the Fourier transforms are 
multiplied componentwise. The running time is 3T(n) + O(n). To obtain the 
ordinary product g\g 2 one may compute 3152 mod x 2n — 1 for any n' > n, leading 
to the estimate M(n) = 3T(2n')+0(n'). While it is known that T(n) = 0(n\ogn) 
for all n, for sufficiently smooth n the implied big-O constant may be smaller 
than the worst case, and one should choose n' to take advantage of this. We 
therefore assume that we have available a set S C Z + with the following properties: 
first, that T(2n) = (1/3 + o(l))M(n) for n G S, and second, that the ratio of 
successive elements of S approaches 1. The choice of S will depend on exactly 
which FFT algorithms are under consideration. For example, Bernstein describes 
a particular class of FFT algorithms for which the above properties hold with 
S = {2 k m : to odd and k > m 2 — 1} [Ber04, p. 5]. Following Bernstein, we call 
elements of S ultrasmooth integers. 

If <?, h G C[x], deg g < 2n, deg h < n, we denote \>y g y\ n h the middle product of 
g and h. That is, if gh = po + p±x n + p2X 2n where pi G C[x], degpi < n, then by 
definition g x n h = p\. Note that gh = (po +P2) +P\x n mod x 2n — 1, so g x„ h may 
be computed by evaluating T^n {^2n{g)^2n{h)) and discarding the first half of the 
output. See [BLS03] and |HQZ04| for more information about the middle product. 

In the algorithms given below, we fix a block size m > 1, and for / G C[x], we 
write / = /[o] + f [x] X + f [2 ]X 2 H , where X = x m and deg/ w < to. 

3. Blockwise multiplication of power series 

Our main tool is a technique for multiplying power series, described in the proof 
of Lemma [T] below. Bernstein used a similar idea in |Ber04l p. 10]. We follow 
van der Hoeven's more recent approach [vdH06] . which uses the middle product to 
obtain a neater algorithm. 
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Figure 1. Blockwise product of power series. Terms contributing 
to (fg)[3] are shaded. 



Lemma 1. Let f,g G CJx] and k > 0. Given as input ^ r 2m(/[o])i • • ■ ,^2m(/[fc]) 
andT 2m {g[o\), ■ ■ ■ ,^2m(9[k])> we may compute (fg)[ k ] in time T(2m)+0(m(k + l)). 

Proof. As shown in Figure [1] we have 

fe 

(fg)[k] = ^(flk-i-i] + f[k-i]X) xi m g[{\, 

i=0 

where for convenience we declare that = 0. Therefore (fg)[ k ] is obtained as 
the second half of 

i ( k 

^2m I J! (^2m(/[/t-i-l]) + ^(/[t-i])^^)) ^2m(9[i]) 

Since {T2m{X))j = (— 1) J , the above expression may be computed from the ^ r 2m(/[z] ) 
and ^ r 2m(5[i]) using 0(m{k + 1)) ring operations, followed by a single inverse trans- 
form of length 2m. □ 

Remark. In Section O we will also need to compute expressions of the form (fg + 
f'g')[k], using the transforms of the blocks of /, /', g and g' as input. Since the 
Fourier transform is linear, the same running time bound T(2m) + 0(m(k + 1)) 
applies (with a larger big-O constant). 



4. Square root 

If / = fa + fix + f2X 2 + • • • and g = f 1/2 = g Q + g x x + g 2 x 2 H , then the 

coefficients of g may be determined by solving sequentially g^ — fo, 25051 = fi, 
25o52 =f2-9x, 25053 = h~ 2.9i52, 2.90.94 = fi- (25153 +52), and so on - Algorithm 
Q] applies this procedure at the level of blocks, retaining the Fourier transform of 
each computed block as it proceeds. 

Proposition 2. Algorithm\l\is correct, and runs in time (4r — 3)T(2m) + 0(r 2 m). 

Proof. By definition g^ is correct. In the fcth iteration of the loop, assume that 
<7[ ], . . . ,5[fc_i] have been computed correctly. Then we have 

(5[o] + • • ■ + g[k-i]X k - 1 ) 2 = / [0] + • • ■ + f [k -i]X k - 1 + ^X k mod X k+1 , 

(5[o] + • • • + g[k]X k ) 2 = /[o] + • • • + /[fc-i]^- 1 + f [k] X k mod X k+1 . 
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Algorithm 1: Square root 



Input: r £ Z, r > 1 

fe CW, / = lmodx 
.9[o] = (/[o]) 1/2 modi 
^ = (.9[o]) _1 mod x 
Output: g = g [0] + ■■■+ . 9[r _ 1] X r - 1 = f 1 ' 2 mod X 1 

1 Compute !F2m(h) 

2 for 1 < fc < r do 



Compute ^2m(9[fc-i]) 

^ <- (0[o] + • • • + .9[fe-i]^ fe_1 ) 2 )[fe] 

Compute T 2m {f[ k ] - VO 

9[k] *~ \Kf[k] ~ i>) mod X 



Subtracting these yields 2<7[ ] = /[fe] ~ ^ mod X, so g^j is computed correctly. 

Each iteration performs one inverse transform to obtain ip (Lemma [1} , one to 
obtain g< k i, and the two forward transforms explicitly stated. The total number of 
transforms is therefore 4(r — 1) + 1 = 4r — 3. The loop also performs O(km) scalar 
operations in the kth iteration (Lemma [1|. □ 

Theorem 3. The square root of a power series f = 1 + fix + • • ■ G C[x] may be 
computed to order n in time (4/3 + o(l))M(n). 

Proof. Let r > 1, and let to be the smallest ultrasmooth integer larger than n/r. 
We let r grow slowly with respect to n; specifically we assume that r — > oo and 
r = o(log?i) as n — ► oo. Then to — > oo as n — ► oo, so to = (1 + o(l))n/r. Zero-pad 
/ up to length rm. Compute grgi = (/[o]) 1 ^ 2 mod a;" 1 and h = (gjo]) -1 mod x m 
using any 0(M(m)) algorithm. Compute f 1 ! 2 mod x rm , hence f 1 ^ 2 mod x n , using 
Algorithm [TJ By Proposition [2] the total cost is 

0(M(m)) + (4r - 3)T(2to) + 0{r 2 m) = (4r/3 + 0(l))M(m) + 0(r 2 m) 

= (4/3 + 0{r~ l ))M{mr) + 0{r 2 m) 

= (4/3 + 0(r- 1 ))Af(n) + 0{rn) 

= (4/3 + o(l))M(n), 

assuming that M(n) = O(nlogn). □ 

Remark. If grgi and /i are computed using Bernstein's (2.5 + o(l))M(n) algorithm 
for the simultaneous computation of the square root and reciprocal square root 
[Ber04| p. 9], then already for r = 3 the new algorithm matches the previous bound 
of (1.8333 . . . + o(l))M(n), and is strictly faster for r > 4. 

Remark. Let / £ C[i] be a monic polynomial of degree 2n. The above algorithm 
may be adapted to compute the square root with remainder, that is, polynomials 
g, h € C[x] with degg = n, deg h < n, and f = g 2 + h, in time (5/3 + o(l))M(n). 

For this, write /(x) = x 2n /(l/x), g(x) = x n g(l/x), h(x) = x n h(l/x). Then 
f,g,h e C[x], and we want to solve /(x) = <?(x) 2 + x n h(x). First compute g(x) 
using the above algorithm; to find h{x) it then suffices to compute g(x) 2 . Observe 
that at the end of Algorithm^] we may compute ((g[a] + • • • + g[ r -i]X r ~ 1 ) 2 )^ for 
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r < j < 2r in time rT(2m) + 0(r 2 m) using Lemma [TJ since the transforms of 
the g\j] are all known. This increases the cost from (4r — 3)T(2m) 4- 0(r 2 m) to 
(5r — 3)T(2m) + 0(r 2 m), leading to the claimed bound in the same way as in the 
proof of Theorem [3] 

5. Reciprocal 

Let / = 1 + fix + ■■■€. C[x], and suppose that g = / _1 mod x n . Then /<? = 
l + 5x n mod cc 3 ™ for some 6 € C[x], deg<5 < 2n. Putting g' — g(l — Sx n + 8 2 x 2n ), we 
have g' = / _1 mod x 3n . This is the third-order Newton iteration for the reciprocal. 

The idea of Algorithm [5] below is to use the above recipe at the level of blocks, 
with an additional twist. If we write S = Sq+Six" , where deg Si < n, then g' = g(l — 
S x n + (S 2 — 8i)x 2n ) mod x 3n . The algorithm first computes S , applying Lemma[TJto 
compute the relevant blocks of fg. Then, instead of computing 5$ and 5% separately, 
it computes the sum 6q — Si in one pass, using only one inverse transform per block 
(see the remark after Lemma [TJ . This is possible since i5o is already completely 
known, and constitutes the main source of savings over Bernstein's algorithm. 



Algorithm 2: Reciprocal 



Input: s e Z, s > 1 

/ € C[ar], / = 1 mod x 
9[o] = (/[o]) _1 mod X 
Output: g = g [Q] + ■■■+ g [3s _ 1} X 3 "- 1 = f- 1 mod X 3s 

1 Compute T 2m {g[o\) 

2 for < i < 3s do compute ^-2m(/[i]) 

3 for 1 < k < s do 



1> - ((/[o] + • • • + f[k]X k )(g m + ■■■ + g^-i^- 1 ))^ 
Compute T2 m {^) 

9[k] < 9[o]^ mod X 

Compute T 2m {9[k}) 



8 for < k < s do 

9 d [k] < ((/[o] + • • • + /[3 S -l]^ 3s_1 )(5[0] + • • • + .9[ s -i]A s " 1 )) [fe+;j ] 

10 Compute ^midm) 

n for s < fc < 2s do 

d[k] *~ ((^[o] + 1- ^[s-i]A s_1 ) 2 )[ fe _ s ] 

- ((/[o] + • ■ • + /pte-i]* 3 - 1 )^] + ' ' ' + ffi-i]^- 1 ))^ 
Compute ^2m(^[fe]) 
is for s < k < 3s do 

is |_ 9[ k ] *- ((d[ ] H h rf[2 S -i]^ 2s_1 )(5[0] H 1" 5[ s -i]^ s_1 ))[fe- s ] 



12 
13 
14 



Proposition 4. Algorithm^ is correct, and runs in time (13s — 3)T(2m) + 0(s 2 m). 

Proof. By definition g^ is correct. Lines [3] [7] compute <7[i], . . . , 9[ s -i] as follows. In 
the fcth iteration of the loop, assume that gm\, . . . ,g\k-i] are correct. Then 

(/[o] + ■ • • + f[k]X k )(g [0] + ■■■ + g {k -i]X k - x ) = l + 4>X k mod X k+ \ 

(/[0] + • • • + /[ & ]* fc )G?[o] + • • • + 9[k]X k ) = 1 mod X k+1 . 
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Subtracting yields /[olflTfel = — V* mod X, so g\u is computed correctly. (This loop is 
essentially van der Hoeven's division algorithm, applied to compute 1/ f mod X s .) 

Now we use the symbols 5, So, Si introduced earlier, putting n = sm. After lines 
[5HTU1 we have dm] + ■ ■ ■ + d[ s _ 1 ]X s ~ 1 — —So, and the subsequent loop computes 

d[ s ] H h d^s-^X 3 - 1 = 81 - Si mod X s . Therefore d [0] H h d^^X 2 ^ 1 = 

—S + 5 2 X S mod X 2s . The final loop computes the appropriate blocks of g' = 
g(l - SX S + 8 2 X 2s ) mod X 3s . 

Altogether the algorithm performs 1 + 3s + 2(s — 1) + s + s forward transforms, 
2(s — l) + s + s + 2s inverse transforms, and 0(s 2 m) scalar operations (apply Lemma 
Q]to each loop). □ 

Theorem 5. The reciprocal of a power series f e C[xJ may be computed to order 
n in time (13/9 + o(l))M(n). 

Proof. Apply the proof of Theorem [3] to Proposition [H with r = 3s. □ 

Remark. If g^ is computed using a (1.5 + o(l))M (n) algorithm, the new algorithm 
achieves the same bound for s = 3 (r = 9), and is faster for s > 4 (r > 12). 

Remark. A natural question is whether the key idea of Algorithm[2]can be extended 
to Newton iterations of arbitrarily high order. That is, if fg = 1 + Sx n , is it 
possible to compute 1 — 8x n + S 2 x 2n ■ ■ ■ ± S k ~ 1 x^- 1 )™ mod x kn in essentially the 
same time as 1 + Sx n mod x kn itself, for arbitrary fc? Algorithm [2] corresponds to 
the case k = 3. An affirmative answer for arbitrary k would presumably lead to a 
(1.333 . . . + o(l))M(n) algorithm for the reciprocal. 
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